Knowledge Discovery

KnowledgeMiner

Knowledge Discovery from Data - an Outlook

Decision making in every field of human activity needs problem detection in addition to a decision makers feeling that a problem exists or that something is wrong. The basis for every decision are models. It is worth building models to aid decision making for the following reasons:

models make it possible

to recognize the structure and function of complicated objects (subject of identification) which leads to deeper understanding of the problem. Models can usually be analysed more readily than the original problem;
to find appropriate means which can be used for exercising an active influence on the objects (subject of control);
to predict what the respective objects have to expect in the future (subject of prediction) but also to experiment with models, and thus to answer "what-if" type questions.

Therefore mathematical modeling formed the core of almost all decision support systems.

Models can be derived from existing theory (theory driven approach or theoretical systems analysis) and/or from data (data driven approach or experimental systems analysis).

a. theory driven approach

For complex ill-defined systems, such as economical, ecological, social, biological a.o. systems, we have insufficient a priori knowledge about the relevant theory of the system under research. Modeling based on theory driven approach is considerably affected by the fact that the modeler often has to know things about the system that are generally impossible to find. This concerns uncertain a priori information with regard to the selection of the model structure (factors of influence and functional relations) as well as insufficient knowledge about interference factors (actual interference factors and factors of influence which can not be measured). According to this insufficient a priori information concerns the required a priori knowledge on the object under research is connected to:

the main factors of influence (endogenous variables or input variables) and also the classification of variables as endogenous and exogenous;
the functional form of the relation between the variables including the dynamic specification of the model;
the description of errors such as their correlation structure.

In order to overcome these problems and to deal with ill-defined systems and, in particular, with insufficient a priori knowledge, there is a need to find ways on how it is possible, with the help of emergent information engineering, to reduce the time and resource intensive model formation process required before one can start initial task solving. Computer-aided design of mathematical models may soon prove as highly valuable in bridging the gap.

In general, if there is only minimal a priori information for ill-defined, complex systems available and, nevertheless, the user is interested in results in his fields data driven approach is a powerful way to generate the required models.

b. data driven approach

Modern information technologies delivers a flood of data and there is a question how to leverage them. Commonly, statistically based principles are used for model formation. But here is the need that one have to have a priori knowledge about the structure of the mathematical model, however.

In addition to the epistemological problems of commonly used statistical principles of model formation, there are several methodological problems which may arise in conjunction with the insufficience of a priori information. This indeterminacy of the starting position marked by the subjectivity and incompletedness of the theoretical knowledge and an insufficient data basis leads to several methodological problems as described in [Lemke/Müller,1997].

Knowledge discovery from data and specifically data mining techniques and tools can assist humans in analyzing the mountains of data and to turn information located in the data into successful decision making.

Data mining includes not only a single analytical technique but a process many methods and techniques are appropriate for, depending on the nature of the enquiry. These methods contain data visualization, tree-based methods and methods of mathematical statistics as well as those for knowledge extraction from data using self-organizing modelling to turn information located in the data into successful decision making.

Data mining is an interactive and iterative process of numerous subtasks and decisions such as data selection and pre-processing, choice and application of data mining algorithms and analysis of the extracted knowledge. Most important for a more sophisticated data mining application is to try to limit the involvement of users in the overall data mining process to the inclusion of existing a priori knowledge while making this process more automated and more objective.

Automatic model generation like GMDH and Analog Complexing is based on these demands and provides sometimes the only way to generate models of ill-defined problems.

Outlook

To extend the modeling spectrum of KnowledgeMiner and to realize the concept of Knowledge Discovery more sophisticated mainly two other outstanding technologies are recently under development:

Self-organizing fuzzy modeling and
Objective Cluster Analysis implementation.

Self-organizing fuzzy modelling

Rule based modeling is one direction in automatic model generation and can be realized by means of binary logic as well as by means of fuzzy logic. More important for ill-defined applications like those in economy, ecology, sociology a.o. will be a self-organizing fuzzy modeling using a GMDH-algorithm which is able e.g. to generate a fuzzy model from a given data set. We can interpret fuzzy modeling as a qualitative modeling scheme by which we can describe a system behavior qualitatively using a natural language. Zadeh suggested the idea of fuzzy sets since he found difficulties in the identification of complex systems based on differential equations with numeric parameters.

Fuzzy modelling is an approach to form a system model using a description language based on fuzzy logic with fuzzy predicates. Such a description is able to qualitatively describe a system behavior using fuzzy quantities.

Objective Cluster Analysis

The purpose of Objective Cluster Analysis algorithm is to automatically subdivide a given data set optimally into groups of data with similar characteristics (classification). The optimal number of clusters, their width and their composition are selected automatically by the algorithm. Clusterization can be considered as a model of an object in a fuzzy language describing relationships.

In conjunction with model-building cluster or pattern analysis is important :

to reduce the dimension of state space by means of subdivision into disjunct sets (nuclei) of similiar variables
to study changes to the state cluster in time or space.

The results are homogeneous parts for the development of complex systems.

frank_lemke@magicvillage.de

Back to the Top Script Software